Description: the implementation of "The Zebra System"
Version: 1.2.0.20210721.special
Group name: YYDS
Authors: Haodong Liu and Jichen Zhao
Airbnb has become a popular platform among holidaymakers and tourists for lodging and rental houses. A host could manage his/her listings, and a guest could select one to fulfill his/her unique and personalised travelling plans. A public Airbnb dataset would be discovered for visualisation tasks. It regards the summary info and metrics of some listings in New York City (NYC), New York, USA for 2019. The data table is stored in the CSV file Airbnb_NYC_2019.csv, which is downloaded from the corresponding dataset info page on Kaggle.
Two information visualisation "systems" have been implemented - "The Giraffe System" (hereinafter called Giraffe) and "The Zebra System" (hereinafter called Zebra). It is because the visualisation tasks would be defined in the same context but different contents. For example, both Giraffe and Zebra would explore a task to consume information by analysing the data, but the specifications would be various. Anyway, we would expect that both of them could provide general insights into the NYC listings for 2019 since the visualisation tasks should help visualise and understand the primary data features and correlations.
Each item (i.e., a listing) originally has 16 attributes as follows. We would keep relevant attributes for visualisation tasks.
| Attribute | Description | Kept |
|---|---|---|
id |
The listing ID | √ |
name |
The listing name | |
host_id |
The host ID | √ |
host_name |
The host name | |
neighbourhood_group |
One of the 5 boroughs in NYC | √ |
neighbourhood |
One of the neighbourhoods in NYC | √ |
latitude |
The latitude coordinate | √ |
longitude |
The longitude coordinate | √ |
room_type |
One of the room types defined by Airbnb | √ |
price |
The price in US dollars for a night stay | √ |
minimum_nights |
The minimum number of nights that a guest can book | |
number_of_reviews |
The number of reviews | |
last_review |
The date of the latest review | |
reviews_per_month |
The number of reviews per month | √ |
calculated_host_listings_count |
The number of different listings for a particular host | √ |
availability_365 |
The number of days for which a particular listing is available in a year |
NOTE:
name and host_name would be removed. We already have unique IDs for listings and hosts, and we are not interested in their names. Hence, they would be dropped to also avoid any potential ethical issue.neighbourhood_group, neighbourhood, and room_type are categorical. This attribute type could be vital for information visualisation.minimum_nights and availability_365 would be removed. These attributes could be significantly subject to the host preferences, and we are not interested in such future data.number_of_reviews would be removed. The listings could be added at different time, and we reckon that the attribute reviews_per_month would be more meaningful. It contains missing values because a particular listing could have no review. In this case, we could simply fill these values with 0.last_review would be removed. We would focus on the generic trend, distribution, etc. This attribute could contribute little for visualisation tasks, since its value could be null and we do not have another clear date for comparison.Before visualisation, it is necessary to understand the interactive nature of charts created using Altair. In plain English, it is essential for you to take advantage of the following features.
A multi selection is similar to a single selection, but it allows for multiple chart objects to be selected at once. By default, chart elements can be added to and removed from the selection by clicking on them while holding the
Shiftkey.
The 7 visualisation tasks are defined as follows. Zebra shares almost the same sections as Giraffe from the start till here because they are necessary preparations. However, the following sections could vary considerably from those of Giraffe since we perform the same visualisation tasks using different design decisions.
| Task | Action | Specification |
|---|---|---|
| #1 | Analyse and consume | Discover the number of listings by borough and room type to find a borough with the most listings and entire rooms/apartments. |
| #2 | Analyse and produce | Derive the per cent of room type by borough to compare between the 2 categories. |
| #3 | Search | Look up the number of Manhattan's neighbourhoods in the top 10 neighbourhoods by the number of listings. |
| #4 | Search | Browse the host ranking by the number of reviews per month and the number of listings to find the host ranking first in each case. |
| #5 | Search | Locate the most popular price range for each borough/room type. |
| #6 | Search | Explore any noticeable pattern in the price distribution by room type. |
| #7 | Query | Identify, compare, and summarise the correlations among prices, locations, the number of listings, boroughs, and room types. |
People might be interested in the question like "who has the most...?" when it comes to comparisons. Bar charts would be a good choice. However, if there are multiple categories for grouping, we had better consider whether the charts need to be stacked or grouped.
NOTE:
# Plot the grouped bar charts.
Sometimes bar charts are used for visualising the rank. We would not say that one surpasses the other, but which one might be more suitable for a specific scenario?
NOTE:
# Plot the vertical bar chart.
Still for a ranking bar chart, it usually consists of a categorical attribute and a quantitative attribute which could be ordered. Is it always a good practice to visualise the data in a specific order?
NOTE:
# Plot the unordered bar charts.
Line charts might be preferred when we try to visualise any relationship or trend. But we should admit that histograms could be versatile. Why not just try and compare them?
NOTE:
# Plot the line charts.
Both plots could provide insights into the distribution of a quantitative attribute. Violin plots could also tell about the density. It does not mean that the violin plots are better. But in the context of distribution, which one would be preferred?
NOTE:
# Plot the violin plot.
It is incredibly convenient to generate a heatmap based on geo-location for this dataset due to the latitude and longitude attributes. Selecting a suitable colour scheme would be vital for successful visualisation. We reckon that it is better to use saturation of the same hue. However, we would like to pretend forgetting it and perform the specific visualisation task. XD
You live, and you learn.
NOTE:
# Plot the map part using saturation of the same hue.